In this homework, the aim is to determine whether a molecule is musk or not(non-musk). A dataset of all integers containing 476 instances with 166 features are given. In the dataset, there are 47 musk and 45 non-musk molecules which are classified according to their conformations. In the dataset, if any of the incident in a bag is “1” than, the molecule is defined as musk, otherwise if all is “0” it is non-musk.
library(data.table)
require(arules)
library(dplyr)
library(tidyverse)
library(ggbiplot)
library(ggplot2)
library(MASS)
library(jpeg)
library(imager)
library(magick)
##################TASK 1#########################
musk=fread("Musk1.csv",stringsAsFactors = FALSE)
head(musk[,1:16])
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16
## 1: 1 1 42 -198 -109 -75 -117 11 23 -88 -28 -27 -232 -212 -66 -286
## 2: 1 1 42 -191 -142 -65 -117 55 49 -170 -45 5 -325 -115 -107 -281
## 3: 1 1 42 -191 -142 -75 -117 11 49 -161 -45 -28 -278 -115 -67 -274
## 4: 1 1 42 -198 -110 -65 -117 55 23 -95 -28 5 -301 -212 -107 -280
## 5: 1 2 42 -198 -102 -75 -117 10 24 -87 -28 -28 -233 -212 -67 -286
## 6: 1 2 42 -191 -142 -65 -117 55 49 -170 -45 6 -324 -114 -106 -280
For PCA analysis prcomp function is used where the data is scaled and centered to reduce the bias in the data coming from the high order of magnitudes data compared to low orders.
After that, multidimensional scattering is performed to the created distance matrice of musk data using different kinds of distance matrices. Shepard graphs show the goodness of the MDS analysis. As can be seen from the figures, Euclidian and Minkowski distances are the better way to create the distance matrixs. As for these cases, the red points are more scattered in the data but they are close to each other. However, blue points indicate are more apparent in the middle of the graph.
## initial value 3.984595
## final value 3.984595
## converged
## initial value 0.736520
## final value 0.736513
## converged
## initial value 0.736520
## final value 0.736513
## converged
In the second part of the first task, the bags are reduced to one vector by taking the mean of the bag members. With this reduction, PCA analysis get better and cumulative standard deviations are more rapidlgetting close to 1 and MDS values also improved.
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 6.6528 5.6257 4.6908 4.2063 3.10049 2.75154 2.24499
## Proportion of Variance 0.2666 0.1907 0.1326 0.1066 0.05791 0.04561 0.03036
## Cumulative Proportion 0.2666 0.4573 0.5898 0.6964 0.75433 0.79993 0.83030
## PC8 PC9 PC10 PC11 PC12 PC13
## Standard deviation 1.88542 1.80613 1.60162 1.41639 1.29688 1.2561
## Proportion of Variance 0.02141 0.01965 0.01545 0.01209 0.01013 0.0095
## Cumulative Proportion 0.85171 0.87136 0.88681 0.89890 0.90903 0.9185
## PC14 PC15 PC16 PC17 PC18 PC19
## Standard deviation 1.23131 1.05946 1.02942 0.93512 0.91230 0.86691
## Proportion of Variance 0.00913 0.00676 0.00638 0.00527 0.00501 0.00453
## Cumulative Proportion 0.92767 0.93443 0.94081 0.94608 0.95110 0.95562
## PC20 PC21 PC22 PC23 PC24 PC25
## Standard deviation 0.79297 0.78039 0.74704 0.6937 0.65812 0.63570
## Proportion of Variance 0.00379 0.00367 0.00336 0.0029 0.00261 0.00243
## Cumulative Proportion 0.95941 0.96308 0.96644 0.9693 0.97195 0.97438
## PC26 PC27 PC28 PC29 PC30 PC31
## Standard deviation 0.62054 0.56340 0.53876 0.5149 0.50515 0.48016
## Proportion of Variance 0.00232 0.00191 0.00175 0.0016 0.00154 0.00139
## Cumulative Proportion 0.97670 0.97862 0.98036 0.9820 0.98350 0.98489
## PC32 PC33 PC34 PC35 PC36 PC37
## Standard deviation 0.46820 0.44470 0.42894 0.40874 0.39965 0.38054
## Proportion of Variance 0.00132 0.00119 0.00111 0.00101 0.00096 0.00087
## Cumulative Proportion 0.98621 0.98740 0.98851 0.98951 0.99048 0.99135
## PC38 PC39 PC40 PC41 PC42 PC43
## Standard deviation 0.3643 0.36024 0.32250 0.31709 0.29101 0.28987
## Proportion of Variance 0.0008 0.00078 0.00063 0.00061 0.00051 0.00051
## Cumulative Proportion 0.9921 0.99293 0.99356 0.99416 0.99467 0.99518
## PC44 PC45 PC46 PC47 PC48 PC49
## Standard deviation 0.27843 0.2579 0.25075 0.22941 0.21491 0.21031
## Proportion of Variance 0.00047 0.0004 0.00038 0.00032 0.00028 0.00027
## Cumulative Proportion 0.99565 0.9960 0.99643 0.99674 0.99702 0.99729
## PC50 PC51 PC52 PC53 PC54 PC55
## Standard deviation 0.19940 0.19567 0.18698 0.1828 0.17357 0.16440
## Proportion of Variance 0.00024 0.00023 0.00021 0.0002 0.00018 0.00016
## Cumulative Proportion 0.99753 0.99776 0.99797 0.9982 0.99835 0.99851
## PC56 PC57 PC58 PC59 PC60 PC61
## Standard deviation 0.15556 0.15071 0.14823 0.13348 0.1277 0.12491
## Proportion of Variance 0.00015 0.00014 0.00013 0.00011 0.0001 0.00009
## Cumulative Proportion 0.99866 0.99880 0.99893 0.99904 0.9991 0.99923
## PC62 PC63 PC64 PC65 PC66 PC67
## Standard deviation 0.11920 0.11037 0.09867 0.09513 0.09437 0.09016
## Proportion of Variance 0.00009 0.00007 0.00006 0.00005 0.00005 0.00005
## Cumulative Proportion 0.99931 0.99939 0.99945 0.99950 0.99955 0.99960
## PC68 PC69 PC70 PC71 PC72 PC73
## Standard deviation 0.08776 0.08368 0.07755 0.07694 0.07280 0.06549
## Proportion of Variance 0.00005 0.00004 0.00004 0.00004 0.00003 0.00003
## Cumulative Proportion 0.99965 0.99969 0.99973 0.99976 0.99979 0.99982
## PC74 PC75 PC76 PC77 PC78 PC79
## Standard deviation 0.06194 0.06145 0.05420 0.05184 0.04933 0.04641
## Proportion of Variance 0.00002 0.00002 0.00002 0.00002 0.00001 0.00001
## Cumulative Proportion 0.99984 0.99987 0.99988 0.99990 0.99992 0.99993
## PC80 PC81 PC82 PC83 PC84 PC85
## Standard deviation 0.04430 0.04222 0.04009 0.03499 0.03489 0.03129
## Proportion of Variance 0.00001 0.00001 0.00001 0.00001 0.00001 0.00001
## Cumulative Proportion 0.99994 0.99995 0.99996 0.99997 0.99998 0.99998
## PC86 PC87 PC88 PC89 PC90 PC91
## Standard deviation 0.02695 0.02536 0.02481 0.02177 0.0201 0.01676
## Proportion of Variance 0.00000 0.00000 0.00000 0.00000 0.0000 0.00000
## Cumulative Proportion 0.99999 0.99999 0.99999 1.00000 1.0000 1.00000
## PC92
## Standard deviation 7.013e-16
## Proportion of Variance 0.000e+00
## Cumulative Proportion 1.000e+00
## initial value 4.024334
## final value 4.024334
## converged
## initial value 0.073393
## final value 0.073373
## converged
## initial value 0.073393
## final value 0.073373
## converged
#TASK 2
In this part image processing is carried out. First, an anime image is read and than processed to have noisy image. After turning the colorful image to grayscale image, 3x3 patches are extracted and PCA analysis is performed. The image can be replotted with the first PCA.
## Standard deviations (1, .., p=9):
## [1] 0.82942284 0.22939386 0.19443172 0.14504610 0.13750839 0.07754772
## [7] 0.05479875 0.04775727 0.03548310
##
## Rotation (n x k) = (9 x 9):
## PC1 PC2 PC3 PC4 PC5
## [1,] -0.3265218 -0.37642247 0.431648049 -0.2486016 -0.1936102
## [2,] -0.3397919 0.04917718 0.422102838 -0.2128238 0.4693133
## [3,] -0.3271763 0.42167753 0.366805668 -0.2538682 -0.2421919
## [4,] -0.3317644 -0.42929957 0.034382258 0.4447482 -0.2728940
## [5,] -0.3464645 0.01115158 -0.001670066 0.5129299 0.4528968
## [6,] -0.3341491 0.41409718 -0.032337218 0.4426103 -0.2966252
## [7,] -0.3247961 -0.43607100 -0.365410958 -0.2533659 -0.2235271
## [8,] -0.3397860 -0.02681887 -0.425401246 -0.2128686 0.4686131
## [9,] -0.3288963 0.36235295 -0.429232740 -0.2492888 -0.2155632
## PC6 PC7 PC8 PC9
## [1,] 0.507872321 -0.2794145018 0.31577278 0.1742490
## [2,] -0.016400633 0.5639243309 -0.03382133 -0.3449346
## [3,] -0.491288354 -0.3213517864 -0.28110962 0.1839497
## [4,] 0.002250105 -0.0306776455 -0.56612608 -0.3327688
## [5,] -0.010460559 0.0003755909 0.00199389 0.6414897
## [6,] 0.003009552 0.0303605266 0.56521354 -0.3339680
## [7,] -0.491683032 0.3204387420 0.28091771 0.1825317
## [8,] -0.015060312 -0.5636411475 0.03367101 -0.3448142
## [9,] 0.508261662 0.2800260049 -0.31639136 0.1755747
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 0.8294 0.22939 0.1944 0.14505 0.13751 0.07755
## Proportion of Variance 0.8280 0.06333 0.0455 0.02532 0.02276 0.00724
## Cumulative Proportion 0.8280 0.89131 0.9368 0.96213 0.98489 0.99213
## PC7 PC8 PC9
## Standard deviation 0.05480 0.04776 0.03548
## Proportion of Variance 0.00361 0.00275 0.00152
## Cumulative Proportion 0.99574 0.99848 1.00000